Sample Design Features Research Articles

Using long-term direct observations in a Polytrichum-Myrtillus pine forest, we have constructed and verified a homogeneous Markov chain model for two dominant species (Vaccinium myrtillus and V. vitis-idaed) at the late stages of succession. The sampling design features a large sample size (2000 quadrats) on permanent transects, several re-examinations with the interval of 5 years, and the use of species rooted frequency. As a model of the process under concern, the discrete Markov chain accounts for the following four states: both species being absent on the quadrat, one of them being present alone, and the joint presence of the both; the model time step coincides with the time interval between observations. The model is calibrated on the data of two successive examinations and verified on that of one more examination. All possible transitions between the states are revealed to realize in quadrats for one time interval, as well as the absence of transitions at each state, which results in the complete digraph (directed graph) of transitions. Major model results are obtained by the formulae of finite Markov chain theory: the steady-state square distribution, cyclicity characteristics, and the mean durations of stages in the fine-scale dynamics. As a steady-state (stable) outcome of succession, the distribution among quadrats is expected where 30% of quadrats are occupied by V. myrtillus alone, 11% by V. vitis-idaea alone, both species are present on 18% of quadrats, and 41% of quadrats are 'empty'. This demonstrates a possibility for V. myrtillus and V. vitis-idaea to coexist stably at the latest stages of succession, with the clear predominance of V. myrtillus, yet without competitive exclusion. The quantitative characteristics of cyclicity and the durations of stages in the fine-scale dynamics enable us to estimate the total duration of secondary post-fire succession as about 45 years (to reach a distribution of states that differs less than 5% from the steady-state one). Out of the four states specified, the quadrats with V. vitis-idaea alone persist for the least time (8 years) on the average, while 'empty' ones persist for the greatest time (18 years). Forecasting the dynamics for one model time step forward and comparing the forecast with the real square distribution have revealed the measure of difference to be 5.4%. This illustrates the efficiency of the (time-)homogeneous Markov chain as a short-term forecast tool, yet leaves open the question whether the homogeneity hypothesis be true in the longer term.

Recent advances in statistical software1 have enabled public health researchers to fit multilevel models to a variety of outcome variables. Multilevel models facilitate inferences regarding unexplained variability among randomly sampled clusters of units (e.g., hospitals) in outcomes of interest and identify covariates that explain the variance in a given outcome at each level of a particular data hierarchy (e.g., patients within hospitals).2,3 Models with random intercepts enable researchers to accommodate correlations within higher-level units resulting from longitudinal or clustered study designs, and models with random coefficients enable researchers to identify higher-level covariates that explain between-cluster variance in relationships of interest.2,3 Public-use survey data sets collected from large national samples, such as the National Health and Nutrition Examination Survey, also have become widely available.4 The samples underlying these data sets are often “complex” in nature for 2 reasons: (1) the use of stratified multistage cluster sampling to increase sampling and cost efficiency and (2) unequal probabilities of selection from target populations for sampled elements, often as a result of oversampling of key subgroups (leading to the need to use weights for generating unbiased population estimates). Secondary analysts can accommodate these design complexities statistically by using “design-based” analyses, which ensure that population inferences are unbiased with respect to the sample design.4 However, these design-based approaches generally do not enable the types of cluster-specific inferences afforded by multilevel models,2,3 and researchers are now considering multilevel models for complex sample survey data. Multilevel modeling represents a “model-based” approach to survey data analysis, in which dependencies in the data introduced by complex sampling features are generally accounted for by sound specification of the underlying probability model.5,6 Advocates of this approach argue that any information contained in the sample design features should be accounted for in the model specification, making the sampling uninformative.5 However, analysts may not have access to covariates capturing all of this information. In this case, the use of weighted estimation when fitting multilevel models provides some protection against potential biases introduced by informative sampling.6 Informed by recent methodological and computational developments in this area,1–3,6,7 we show that changes in inferences are possible when fitting multilevel models to complex sample survey data and ignoring the sampling weights. We analyzed data from the 2013 Medical Monitoring Project HIV Provider Survey, sponsored by the Centers for Disease Control and Prevention, for which a probability sample of HIV care providers was selected from outpatient HIV care facilities in 16 states and Puerto Rico.8,9 Briefly, the provider survey followed a 2-stage probability-proportionate-to-size sample design, first sampling states and territories and then HIV facilities and selecting all providers within a facility. Unbiased estimation of multilevel model parameters requires the use of weights at all levels of a given data hierarchy,7 so we used previously calculated sampling weights adjusted for nonresponse at the facility level and inverses of estimated response probabilities at the provider level. We focus on only facilities with multiple responding providers and include covariates that are both theoretically relevant for the dependent variables described later in this article and related to the sampling weights (e.g., an indicator of the provider serving more than 200 patients). Details about computation of the Medical Monitoring Project sampling weights for both providers and facilities are available on request.10 We scaled the final provider-level weights to sum to the sample sizes within each facility. A failure to do this would overstate actual sample sizes within each higher-level unit (facility), possibly resulting in biased estimates of model parameters.2,3,7 We fit multilevel logistic regression models to 2 binary dependent variables, indicating whether the responding provider delivered adequate drug use risk reduction and sexual risk reduction services to patients (defined as delivering approximately 70% of recommended risk reduction services to most or all of the patients). The models included random intercepts to capture between-facility variation in each proportion, in addition to fixed effects of several provider- and facility-level covariates of interest. We fit these models with the new GLIMMIX command11 in SAS/STAT version 13.1 (SAS Institute, Cary, NC), which can fit multilevel models to complex sample survey data. Identical results can be obtained with the new svy: melogit command in Stata version 14 (StataCorp LP, College Station, TX). We did not test whether the parameter differences in the weighted and unweighted models were significant,12 but we did observe several shifts in inference when using weighted estimation (Table A; available as a supplement to the online version of this article at http://www.ajph.org). In both models, the intercept became more negative and significant, suggesting that the probability of using adequate risk reduction was being overstated for the type of provider represented by zeroes on all of the covariates (which may not be entirely meaningful in all models). For drug risk reduction, the coefficient for delivering care in a language other than English became nonsignificant. For the sexual risk reduction outcome, the male provider coefficient became significant, and the Black provider, nurse practitioner, and integrated team effects became even stronger. Finally, the estimated variability of the random facility intercepts was clearly being overstated when ignoring the weights, and the weighted models explained more of the variance in the outcomes at each level. The weights at each level were clearly informative about the parameters defining these models, and ignoring them in analysis would have led to erroneous inferences with respect to the sample design used. Notably, these results held despite the inclusion of available covariates related to the sampling weights in the models. In practice, covariates used to compute the weights or the weights at each level of the data hierarchy may not be available to the public, making appropriate design-adjusted estimation of multilevel models difficult or impossible. We encourage analysts fitting multilevel models to survey data to carefully examine the variables available for weighted estimation in these data sets, make use of the powerful software1–3,11 that has been developed in this area, and (when possible) examine whether weighted estimation or adjustment for covariates related to the weights affects their inferences.

Sample Design Features Research Articles

Related Topics

Articles published on Sample Design Features

Mammography Screening Outreach Through Non-Primary Care–Based Services

Spsurvey: Spatial Sampling Design and Analysis in R.

Influencing factors of using Korean Medicine services – focusing on the 2017 Korean Medicine Utilization Survey

Leveraging Emergency Department Encounters to Improve Cancer Screening Adherence.

Estimation and correction of bias in network simulations based on respondent-driven sampling data

Power to detect trends in abundance within a distance sampling framework

Comparative study on the complex samples design features using SPSS Complex Samples, SAS Complex Samples and WesVarPc

Analyzing the Fine-Scale Dynamics of Two Dominant Species in a Polytrichum–Myrtillus Pine Forest. I. A Homogeneous Markov Chain and Cyclicity Indices

How Big of a Problem is Analytic Error in Secondary Analyses of Survey Data?

Multiple Imputation in Two-Stage Cluster Samples Using The Weighted Finite Population Bayesian Bootstrap.

Weighted Multilevel Models: A Case Study.

Generating synthetic data to produce public-use microdata for small geographic areas based on complex sample survey data with application to the National Health Interview Survey

An inconvenient dataset: bias and inappropriate inference with the multilevel model

Fish dispersal in fragmented landscapes: a modeling framework for quantifying the permeability of structural barriers

Fish dispersal in fragmented landscapes: a modeling framework for quantifying the permeability of structural barriers

Inverse Association between Fruit and Vegetable Intake and BMI even after Controlling for Demographic, Socioeconomic and Lifestyle Factors

Sampling designs for accuracy assessment of land cover

Combining Information From Two Surveys to Estimate County-Level Prevalence Rates of Cancer Risk Factors and Screening

Complex sample design effects and inference for mental health survey data

Lead the way for us